AITopics | confidence measure

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > New Zealand (0.05)
Oceania > Australia (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Volodymyr Kuleshov, Percy S. Liang

Calibrated Structured Prediction

Neural Information Processing SystemsOct-2-2025, 07:08:00 GMT

In user-facing applications, displaying calibrated confidence measures-- probabilities that correspond to true frequency--can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibra-tion method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.

artificial intelligence, machine learning, optical character recognition, (17 more...)

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > Massachusetts (0.04)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.94)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

arXiv.org Artificial IntelligenceSep-30-2025

Beyond the Hook: Predicting Billboard Hot 100 Chart Inclusion with Machine Learning from Streaming, Audio Signals, and Perceptual Features

Mountzouris, Christos

The advent of digital streaming platforms have recently revolutionized the landscape of music industry, with the ensuing digitalization providing structured data collections that open new research avenues for investigating popularity dynamics and mainstream success. The present work explored which determinants hold the strongest predictive influence for a track's inclusion in the Billboard Hot 100 charts, including streaming popularity, measurable audio signal attributes, and probabilistic indicators of human listening. The analysis revealed that popularity was by far the most decisive predictor of Billboard Hot 100 inclusion, with considerable contribution from instrumentalness, valence, duration and speechiness. Logistic Regression achieved 90.0% accuracy, with very high recall for charting singles (0.986) but lower recall for non-charting ones (0.813), yielding balanced F1-scores around 0.90. Random Forest slightly improved performance to 90.4% accuracy, maintaining near-perfect precision for non-charting singles (0.990) and high recall for charting ones (0.992), with F1-scores up to 0.91. Gradient Boosting (XGBoost) reached 90.3% accuracy, delivering a more balanced trade-off by improving recall for non-charting singles (0.837) while sustaining high recall for charting ones (0.969), resulting in F1-scores comparable to the other models.

artificial intelligence, billboard hot 100, machine learning, (17 more...)

2509.24856

Country:

North America > United States (0.14)
Europe > Greece > West Greece > Patra (0.04)

Genre: Research Report > New Finding (0.37)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.96)

Neural Information Processing SystemsAug-20-2025, 07:55:48 GMT

e9bf14a419d77534105016f5ec122d62-AuthorFeedback.pdf

confidence measure, final submission, initialisation scheme, (14 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.34)

Neural Information Processing SystemsAug-15-2025, 17:37:51 GMT

6fac9e316a4ae75ea244ddcef1982c71-Paper-Conference.pdf

computational linguistic, machine learning, natural language, (19 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > New Zealand (0.05)
Oceania > Australia (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Kang, Zhewei, Zhao, Xuandong, Song, Dawn

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

arXiv.org Artificial IntelligenceFeb-25-2025

Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty

arxiv preprint arxiv, dataset, scalable best-of-n selection, (10 more...)

2502.18581

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

arXiv.org Artificial IntelligenceFeb-20-2025

MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels

Liu, Xiaoou, Lin, Zhen, Da, Longchao, Chen, Chacha, Trivedi, Shubhendu, Wei, Hua

Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.

confidence measure, correctness, dataset, (12 more...)

2502.14268

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.36)
Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)

Neural Information Processing SystemsOct-7-2024, 23:37:05 GMT

Reviews: A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

The problem of the study is detecting abnormalities within deep neural networks, to detect out-of-distribution inputs, adversarial inputs, and new classes (for class incremental learning). To achieve this, the authors integrate class-conditional Gaussian distributions with a tied covariance (linear discriminant analysis) at various stages of a target neural network and construct distributions over the valid input (in-liers). They use the Mahalanobis distance measure of the Gaussian distribution as a confidence measure (proportional to the log-likelihood). They further enhance the confidence measure by taking Fast Gradient-Sign Method-style steps in the input space to increase the score. Finally, they combine the scores gathered at different layers of the neural network through a linear combination.

detecting out-of-distribution sample, out-of-distribution sample and adversarial attack, simple unified framework, (9 more...)

Industry:

Information Technology > Security & Privacy (0.40)
Government > Military (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Naderi, Maryam, Hermann, Enno, Nanchen, Alexandre, Hovsepyan, Sevada, -Doss, Mathew Magimai.

Towards interfacing large language models with ASR systems using confidence measures and prompting

arXiv.org Artificial IntelligenceJul-31-2024

As large language models (LLMs) grow in parameter size and capabilities, such as interaction through prompting, they open up new ways of interfacing with automatic speech recognition (ASR) systems beyond rescoring n-best lists. This work investigates post-hoc correction of ASR transcripts with LLMs. To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.

correction, llm, transcription, (14 more...)

2407.21414

Country: Europe > Switzerland (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Lin, Zhen, Trivedi, Shubhendu, Sun, Jimeng

Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

arXiv.org Artificial IntelligenceJun-3-2024

The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.

gemma, mistral, trivia, (13 more...)